Co-Occurrence Vectors From Corpora Vs. Distance Vectors From Dictionaries

نویسندگان

  • Yoshiki Niwa
  • Yoshihiko Nitta
چکیده

A comparison W~LS made of vectors derived by using ordinary co-occurrence statistics from large text corpora and of vectors derived by measuring the interword distances in dictionary definitions. The precision of word sense disambiguation by using co-occurrence vectors frorn the 1987 Wall Street Journal (20M total words) was higher than that by using distance vectors from the Collins English l)ictionary (60K head words + 1.6M definition words), llowever, other experimen-tal results suggest that distance vectors contain some different semantic information from co-occurrence vectors. 1 I n t r o d u c t i o n Word vectors reflecting word meanings are expected to enable numerical approaches to semantics. Some early attempts at vector representation in I)sycholinguistics were the semantic d(O'erential approach (Osgood et al. 1957) and the associative distribution apl)roach (Deese 1962). llowever, they were derived manually through psychological experiments. An early at tempt at automation was made I)y Wilks el aL (t990) us-. ing co-occurrence statistics. Since then, there haw" been some promising results from using co-occurrence vectors, such as word sense disambiguation (Schiitze [993), and word clustering (Pereira eL al. 1993). llowever, using the co-occurrence statistics requires a huge corpus that covers even most rare words. We recently developed word vectors that are derived from an ordinary dictionary by measuring the interword distances in the word definitions (Niwa and Nitta 1993). ' this method, by its nature, h~s no prol)lom handling rare words. In this paper we examine the nsefldness of these distance vectors as semantic re W resentations by comparing them with co-occur,'ence vectors. 2 D i s t a n c e V e c t o r s A reference network of the words in a dictionary (Fig. 1) is used to measure the distance between words, q'he network is a graph that shows which words are used in the. definition of each word (Nitta 1988). The network shown in Fig. 1 is for a w~ry small portion of the reference network for the Collins English 1)ictionary (1979 edition) in the CI)-I{OM I (Liberman 1991), with 60K head words -b 1.6M definition words. writing unit (Or) \ / word comnmnieation / ~ alphMmtlcal \ L _ \ / l a n g u a g , dictionary o, / \ ( : ) p,.~ople ~,ook (O~) Fig. 1. Portion of a reference network. For example, tile delinition for diclionarg is % book ill which the words of a language are listed alphabetically . . . . " The word dicliona~d is thus linked to the words book, word, language, and alphabelical. A word w~etor is defined its the list of distances from a word to a certain sew of selected words, which we call origins. The words in Fig. 1 marked with Oi (unit, book, and people) m'e assumed to be origin words. In principle, origin words can be freoly chosen. In our exl~eriments we used mi(Idle fi'equency words: the 51st to 1050th most frequent words in the reference Collins English I)ictiotmry (CI';D), The distance w~ctor fl)r diclionary is deriwM it'* fob lOWS: ~ ) ... disti~uc,, ((ticl., 01) dictionary ~ 1 . . . distance (dict., 0'2) 2 . . . distance (dicL, Oa) The i-4,h element is the distance (the length of the shortest path) between diclionary and the i-th origin, Oi. To begin, we assume every link has a constant length o[' 1. The actual definition for link length will be given later. If word A is used in the definition of word B, t.he,m words are expected to be strongly related. This is the basis of our hypothesis that the distances in the refi~rence network reflect the associative distances between words (Nitta 1933).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring the Validity of Corpus-derived Measures of Semantic Similarity

Lexical co-occurrence counts from large corpora have been used to construct highdimensional vector-space models of language. In this type of model words are represented as vectors (or points) in a hyperspace, and distances between word vectors are generally considered to reflect semantic similarity. Two issues must be addressed if a vector-space model is to be used as a 'semantic' measuring dev...

متن کامل

Experimenting with Extracting Lexical Dictionaries from Comparable Corpora for English-Romanian language pair

The paper describes a tool developed in the context of the ACCURAT project (Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation). The purpose of the tool is to extract bilingual lexical dictionaries (word-to-word) from comparable corpora which do not have to be aligned at any level (document, paragraph, etc.) The method implemented in this tool is intr...

متن کامل

Improving Word Sense Discrimination with Gloss Augmented Feature Vectors

This paper presents a method of unsupervised word sense discrimination that augments co–occurrence feature vectors derived from raw untagged corpora with information from the glosses found in a machine readable dictionary. Each content word that occurs in the context of a target word to be discriminated is represented by a co-occurrence feature vector. Each of these vectors is augmented with th...

متن کامل

Nonlocal Language Modeling based on Context Co-occurrence Vectors

This paper presents a novel nonlocal language model which utilizes contextual information. A reduced vector space model calculated from co-occurrences of word pairs provides word co-occurrence vectors. The sum of word cooccurrence vectors represents the context of a document, and the cosine similarity between the context vector and the word co-occurrence vectors represents the long-distance lex...

متن کامل

Term Representation with Generalized Latent Semantic Analysis

Document indexing and representation of termdocument relations are very important issues for document clustering and retrieval. In this paper, we present Generalized Latent Semantic Analysis as a framework for computing semantically motivated term and document vectors. Our focus on term vectors is motivated by the recent success of co-occurrence based measures of semantic similarity obtained fr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994